Language modeling for speech recognition of spoken Cantonese

نویسندگان

  • Yu Ting Yeung
  • Houwei Cao
  • Nengheng Zheng
  • Tan Lee
  • Pak-Chung Ching
چکیده

This paper addresses the problem of language modeling for LVCSR of Cantonese spoken in daily communication. As a spoken dialect, Cantonese is not used in written documents and published materials. Thus it is difficult to collect sufficient amount of written Cantonese text data for the training of statistical language models. We propose to solve this problem by translating standard Chinese text, which is much easier to find, into written Cantonese. A rule-based method of translation is devised and implemented. Three different language models are trained from different types of text. They are evaluated in the task of LVCSR. Experimental results confirm that the translated text can well represent Cantonese spoken in formal occasions like broadcast news. For colloquial Cantonese, language model adaptation with a limited amount of colloquial Cantonese text data would be a practically feasible solution that leads to reasonable speech recognition performance.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic Recognition of Cantonese-English Code-Mixing Speech

Code-mixing is a common phenomenon in bilingual societies. It refers to the intra-sentential switching of two different languages in a spoken utterance. This paper presents the first study on automatic recognition of Cantonese-English code-mixing speech, which is common in Hong Kong. This study starts with the design and compilation of code-mixing speech and text corpora. The problems of acoust...

متن کامل

Recent Advances in Cantonese Speech Recognition

This paper describes our recent work on automatic recognition of Cantonese. Cantonese is one of the major Chinese dialects, spoken by tens of millions of people in Southern China and Hong Kong. For isolated Cantonese syllables, a neural network based recognition algorithm has been successfully developed and the most up-to-date recognition results are presented. For continuous Cantonese speech, ...

متن کامل

Spoken language resources for Cantonese speech processing

This paper describes the development of CU Corpora, a series of large-scale speech corpora for Cantonese. Cantonese is the most commonly spoken Chinese dialect in Southern China and Hong Kong. CU Corpora are the first of their kind and intended to serve as an important infrastructure for the advancement of speech recognition and synthesis technologies for this widely used Chinese dialect. They ...

متن کامل

Design, Compilation and Processing of CUCall: A Set of Cantonese Spoken Language Corpora Collected Over Telephone Networks

The design and compilation of the CUCall telephone speech corpora is described in this paper. Speech database is an indispensable resource for research and development of state-of-the-art spoken language technology. These speech recognition systems rely greatly on a huge amount of well-designed and appropriately processed speech data for parameters training. On the other hand, as telephony appl...

متن کامل

Towards Highly Usable and Robust Spoken Language Technologies for Chinese

This paper gives an overview of our research on Chinese spoken language technologies during the past ten years. It covers fundamental acoustic-phonetic studies of spoken Cantonese, speech corpora development, automatic speech recognition and text-to-speech. Currently our focus is on making these technologies more usable for general users who are not speech experts, and more robust for real-worl...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008